Regular expressions and requests

The last lecture concetrated on the basic usage of regular expressions when dealing with strings in Python or a locally imported .txt file. This lecture introduces a new library called requests, qhich allows to send a request to a given URL and receive the HTML source of the page as an output/response. The output/response can then be used for scraping purposes, and that's when the RegEx may come handy. This lecture also intrduces some nice methods for working with JSON files.


In [1]:
import re
import requests

As you may remember, when introducing JSON files, we got acquainted with an API that had data on people currently in Sky. The data was given, as most of the cases with APIs, in JSON format.


In [2]:
url = "http://api.open-notify.org/astros.json"

In order to get the data from the page, we will use the get() function from the requests library.


In [3]:
response = requests.get(url)

The response was saved in a variable, which we decided to call directly response (but, of course, you may have called it anything you want). If you check the type, you will see that the response variable has a specific type called Response.


In [4]:
type(response)


Out[4]:
requests.models.Response

I believe you may have at least once experienced an error when trying to reach some website. Probably the most popular one is "404 error - Website was not found". Those are coded statements when dealing with websites (more precisely, when dealing with HTTP requests). The full list of HTTP status codes can be found on Wikipedia. One of the codes is "200 - OK", which basically tells that everything went well. In our case, if we were able to succesfully receive the data from the URL above, then we should actually receive the 200 status code. This can be seen with just printing the response.


In [5]:
print(response)


<Response [200]>

As you have seen above, printing the response results in the status code (e.g. OK, error or anything else) but not in the content of the website. To receive the content/data, we must use some of the available functions from the requests library. In our case, as the data is given in a JSON format, we will use the JSON() function from the requests library, that can freely operate on our response object/variable.


In [6]:
data = response.json()

In [7]:
type(data)


Out[7]:
dict

As you can see above the type of our data variable, which includes the data in JSON format, is a usual Python dictionary. Let's print it, to see the content.


In [8]:
print(data)


{u'message': u'success', u'number': 3, u'people': [{u'craft': u'ISS', u'name': u'Peggy Whitson'}, {u'craft': u'ISS', u'name': u'Fyodor Yurchikhin'}, {u'craft': u'ISS', u'name': u'Jack Fischer'}]}

When printing a dictionary, especially those that have nested values, the output is not very readible and user friendly. For that purpose, one may use a standout library, to make printing pretty. The pretty printing library is called pprint and the function we need to use from that library is again called pprint.


In [9]:
from pprint import pprint

In [10]:
pprint(data)


{u'message': u'success',
 u'number': 3,
 u'people': [{u'craft': u'ISS', u'name': u'Peggy Whitson'},
             {u'craft': u'ISS', u'name': u'Fyodor Yurchikhin'},
             {u'craft': u'ISS', u'name': u'Jack Fischer'}]}

Now, the content is the same, the type of variable is the same, yet the printing output is much more readible. As can see, unlike before, there are only 3 people in space currently.

You may wonder why we have those u-s in front of each string. It stands for the term unicode indicating the encoding of a string. It is fine that you see that, but the users will not, so you may just neglect that letter.

Let's now move to RegEx. Regular expressions deal with text, however our dataset is a JSON file loaded to Python as a dictionary. In order to be able to apply regular expressions on them, it is necessary to convert the dictionary (or part of it, that interests us) to a simple string.


In [11]:
converted = str(data)

In [12]:
type(converted)


Out[12]:
str

In [13]:
print(converted)


{u'message': u'success', u'number': 3, u'people': [{u'craft': u'ISS', u'name': u'Peggy Whitson'}, {u'craft': u'ISS', u'name': u'Fyodor Yurchikhin'}, {u'craft': u'ISS', u'name': u'Jack Fischer'}]}

As before, the printing results are not very readible, but what do you think the pprint() function will result in?


In [14]:
pprint(converted)


"{u'message': u'success', u'number': 3, u'people': [{u'craft': u'ISS', u'name': u'Peggy Whitson'}, {u'craft': u'ISS', u'name': u'Fyodor Yurchikhin'}, {u'craft': u'ISS', u'name': u'Jack Fischer'}]}"

The latter produces almost the same output, again not very user friendly. The thing is that now our variable of interest (converted) is not a dictionary, which makes it impossible for pprint to understand what are the keys and what are the values.

Anyway, let's apply RegEx to that non-user-friendly string. First method that we will apply from the re package is again findall. As the only numeric object is the number of human currently in space, we can easily learn that number just by searching for a number in the whole output text.


In [15]:
output = re.findall('[0-9]',converted)
print output


['3']

Let's now extract the full names of "spacemen". The first letters of names and surnames will be the only letters in the text that are uppercase and are followed by lowercase. There will also be one whitespace character between a name and a surname.


In [16]:
output = re.findall('[A-Z][a-z]+\s[A-Z][a-z]+',converted)
print output


['Peggy Whitson', 'Fyodor Yurchikhin', 'Jack Fischer']

Please note, that we were lucky that there was no sspanish or mexican guy in the sky, as they might have a superlong full name consisting of several elements separated by several whitespaces (e.g. Maria Teresa García Ramírez de Arroyo).

Let's now move back to text files. Last time we went to the Project Gutenberg webpage and found the Theodore Dreiser book titled "The financier". The book was downloaded and then imported to Python. THis time, we will again use that book to apply regular expressions, yet we will not manually download it rather we will read it directly from the URL.


In [17]:
url_book = "https://www.gutenberg.org/files/1840/1840-0.txt"

In [18]:
response_book = requests.get(url_book)

The response we received about people in space was in JSON format. This time it is in .txt format. The correct method here is the text method, which provides the text from the received response.


In [19]:
financier_online = response_book.text

If you check the type of the variable above (financiaer_online) you will see unicode as an outcome, which is the encoding standard of the received text.


In [20]:
type(financier_online)


Out[20]:
unicode

As the whole book was received as a string, we do not want to print all of it, so let's print part of it.


In [21]:
financier_online[70:205]


Out[21]:
u'This eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r'

Let's again see how many times Mr. Dreiser uses the $ sign in his book.


In [22]:
output = re.findall("\$",financier_online)
print output


[u'$', u'$', u'$']

As you can see the "u" gut again appeared in fron of each string inside the list. But, as mentioned before, this will not affec tthe results and will not be visible to end user. To feel safer, one can prove that to himself/herself by just printing a single element from the output list and see that "u" disappears.


In [23]:
print output[0]


$

Let's see how many times and on what occasions the symbol "@" was used.


In [24]:
output = re.findall("@",financier_online)
print output


[u'@', u'@']

In [25]:
output = re.findall("\S+@\S+",financier_online)
print output


[u'business@pglaf.org.', u'gbnewby@pglaf.org']

Well then, let's move from The financier to my webpage and get the HTML source from it.


In [26]:
url_Hrant = "https://hrantdavtyan.github.io/"

In [27]:
response_Hrant = requests.get(url_Hrant)

As before, we are interested in text content. HTML document is nothing else than text, so we can use the same method to get the data.


In [28]:
my_page = response_Hrant.text

In [29]:
my_page[:500]


Out[29]:
u'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">\r\n\r\n<head>\r\n    \r\n    <title>Hrant Davtyan</title>\r\n    <meta http-equiv="content-type" content="text/html; charset=utf-8"/>\r\n    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\r\n    <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">\r\n    <link rel="stylesheet" type="text/css" href="css/reset.'

Let's now try to find my e-mail from the website.


In [30]:
output = re.findall('\S+@\S+',my_page)
print output


[u'<li><label>Email</label><span>hdavtyan@aua.am</span></li>', u'hdavtyan@aua.am</li>']

It is fine, we were able to find my e-mail (twice!) however, unlike a simple .txt document, here we do not have a lot of whitespaces because almost everything is inside HTML tags. Which means we should actuallt get rid of them using RegEx. There are many approaches that can be used. One of the approaches is to substitute all the tag elements with a whitespace.


In [31]:
output = re.findall('\S+@\S+',my_page)
print re.sub(r'<.*?>'," ",str(output))


[u'  Email  hdavtyan@aua.am  ', u'hdavtyan@aua.am ']

Much better, but still we get the "Email" text before the actual e-mail. The thing is we replaced the tag in the output, while it would have been much more beneficial if we could replace in all the text and then only search for the e-mail match inside the new text without tags.


In [34]:
page_no_tags = re.sub(r'<.*?>'," ",my_page)
print re.findall('\S+@\S+',page_no_tags)


[u'hdavtyan@aua.am', u'hdavtyan@aua.am']

Now it works. But this does not mean you should always drop the tags (e.g. replace them with a space). Sometimes tags can be useful to find the correct text you are interested in. For example, <a> tag or the href = " " element can be useful when searching for hyperlinks.


In [36]:
output = re.findall('<a href\s*=\s*"(.+)"\s*>',my_page)
print output


[u'#" class="social-text', u'https://www.facebook.com/HrantDavtyan" class="social-facebook', u'https://www.linkedin.com/in/hrantdavtyan" class="social-in', u'#profile" class="tab-profile', u'#resume" class="tab-resume', u'#portfolio" class="tab-portfolio', u'#contact" class="tab-contact', u'http://aua.am/', u'http://isifa.am/en/ifa-dfi/', u'https://www.cerge-ei.cz/', u'http://www.fao.org/armenia/en/', u'http://www.metric.am/', u'http://ahpc.am/?lang=en', u'http://www.ucl.ac.uk/', u'http://iset.tsu.ge/', u'" class="current" data-filter="*', u'" data-filter=".statistics', u'" data-filter=".programming', u'" data-filter=".teaching', u'" data-filter=".survey', u'" data-filter=".other', u'https://github.com/HrantDavtyan/HrantDavtyan.github.io\\teaching\\jdocs\\Business Analytics\\html\\index.html" rel="portfolio" target="_new', u'http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png" title="Damage and Loss calculator" rel="portfolio" class="folio', u'http://www.armstat.am/en/?nid=661" rel="portfolio" class="folio iframe', u'http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png" rel="portfolio" class="folio', u'http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png" rel="portfolio" class="folio', u'http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png" rel="portfolio" class="folio', u'http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png" rel="portfolio" class="folio', u'http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png" rel="portfolio" class="folio']

It seems we succesfully received all the links inside an <a> tag (not the links inside a <link> tag). Let's check how many of them are received.


In [37]:
len(output)


Out[37]:
29

Excellent! If you manually open my page and try to look for a "href" element, you will find 33 and 4 of them are inside a <link> tag. So we correctly received all the necessary elements. But as you can see, we also received some text coming after the " sign. The reason we have this situation is greediness. As we discussed last time, some expressions are greedy and try to match as many things as they can. So now it matched the last possible " sign in the expression. Also, the URLs matched include some empty ones or those that have only # sign. We need to have only URls that are non-empty, i.e. non-whitespace.


In [51]:
output = re.findall(r'<a href\s*=\s*"(\S+)"\s*>',my_page)
print output


[u'http://aua.am/', u'http://isifa.am/en/ifa-dfi/', u'https://www.cerge-ei.cz/', u'http://www.fao.org/armenia/en/', u'http://www.metric.am/', u'http://ahpc.am/?lang=en', u'http://www.ucl.ac.uk/', u'http://iset.tsu.ge/']

Fine, now we have only those URLs that are not folloewed by anything. This is the reasons there is no place for FB or LinkedIn etc. To get them also we just need to replace the non-whitespace character with any (.).


In [52]:
output = re.findall(r'<a href\s*=\s*"(\S+)".*>',my_page)
print output


[u'#', u'https://www.facebook.com/HrantDavtyan', u'https://www.linkedin.com/in/hrantdavtyan', u'#profile', u'#resume', u'#portfolio', u'#contact', u'http://aua.am/', u'http://isifa.am/en/ifa-dfi/', u'https://www.cerge-ei.cz/', u'http://www.fao.org/armenia/en/', u'http://www.metric.am/', u'http://ahpc.am/?lang=en', u'http://www.ucl.ac.uk/', u'http://iset.tsu.ge/', u'http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png', u'http://www.armstat.am/en/?nid=661', u'http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png', u'http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png', u'http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png', u'http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png', u'http://clearwaterbeachassoc.com/wp-content/uploads/2016/01/Under-construction.png']